Human resources are the most valuable asset in any country (Peretti, 2004). They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.
Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research (Amadieu and Tricot, 2022), other researchers have been interested in the impact of other elements such as mobility (Richard, 2009), gender and other socio-economic factors on students’ academic success.
Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.
The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.
First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.
Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.
What is the evolution of student performance over time and across the different regions/departments of France?
Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?
Has the COVID-19 pandemic impacted student performance?
This dataset presents the results of the “diplôme national du brevet” by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.
| Variable | Meaning |
|---|---|
| session | Year of the exam session |
| school_id | School identification number |
| school_type | School type divided in six categories: COLLEGE, LYCEE PROFESSIONNEL, LYCEE, EREA, CFA, and AUTRE |
| establishment_name | Name of the establishment |
| education_sector | Education sector categorised as public or private |
| municipality_code | Municipality code |
| municipality | Name of the municipality |
| department_code | Department code. It is to be noted that France has 101 departements. |
| department | Name of the department |
| academy_code | Academy code |
| academy_name | Name of the academy |
| region_code | Region code. It is to be noted that France has 18 administrative, regions |
| region | Name of the region |
| registered | Registered candidates |
| present | Candidates present for the exam |
| admitted | Candidates admitted |
| admitted_without | Candidates admitted without distinction |
| admitted_AB | Candidates admitted with distinction “Assez Bien” |
| admitted_B | Candidates admitted with distinction “Bien” |
| admitted_TB | Candidates admitted with distinction “Très bien” |
| success_rate | “Success rate [Present]/[Admis] as a percentage” |
This data set gathers all schools which have been awarded the “Generation 2024” label. The objective of this label, developed in view of the Paris 2024 Olympic Games, is to develop bridges between the school world and the sports movement in order to encourage young people to take part in physical activity and sport. This data set contains 6’883 observations.
| Variable | Meaning |
|---|---|
| region | Name of the region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| establishment | Name of the establishment |
| school_id | School identification number |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| postcode | Postcode |
| adress | Address of the establishment |
| adress_2 | Additional address of the establishment |
| E-mail address of the establishment | |
| students | Number of students in the school |
| priority_education | Indicates whether the school is located in a priority education network (REP) or a reinforced priority education network (REP+) |
| city_school | Indicates whether the school is part of a city school |
| QPV | Position relative to a priority neighbourhood of the city policy. It is a policy aimed at compensating for differences in living standards with the rest of the territory. |
| ULIS | “Indicates whether the school offers a ULIS (Localized Unit for School Inclusion)” |
| SEGPA | “Indicates whether the school has a SEPGA (adapted general and vocational education sections)” |
| sport_section | Indicates whether the school has a sports section |
| agricultural_high_school | Indicates whether the school is an agricultural high school |
| military_high_school | Indicates whether the school is a military high school |
| vocational_high_school | Indicates whether the establishment is labeled “vocational high school” |
| establishment_web | Url of the description of the establishment page on the ONISEP website |
| SIREN_SIRET | “SIREN/SIRET number of the establishment. SIREN is for Business Register Identification System in french.” |
| district | Name of the district to which the school is attached |
| ministry | Ministry responsible for the institution |
| label_start_date | Start date of the “generation 2024” label. Format yyyy/mm/dd |
| label_end_date | End date of the “generation 2024” label. Format yyyy/mm/dd |
| y_coordinate | Y coordinate of the establishment, using the EPSG coordinate system |
| x_coordinate | X coordinate of the establishment, using the EPSG coordinate system |
| epsg | EPSG code of the coordinate system used to locate the establishment |
| precision_on_localisation | Specification of the geographical location of the establishment |
| latitude | Latitude |
| longitude | Longitude |
| position | Geographical position |
| engaging_30_sport | Indicates whether the institution participates in the 30 minutes of daily physical activity programme |
This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.
| Variable | Meaning |
|---|---|
| year_back_to_school | Year of the start of the school year |
| region | Name of the academic region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| number | School identification number |
| establishment_main_name | Main name of the establishment |
| establishment_name | Name of the establishment |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| students_secondary_education | Students in secondary education |
| students_higher_education | Number of students in higher education |
| external_students_secondary_education | External students in secondary education |
| half_boarders_students_secondary_education | Half-boarders in secondary education |
| boarding_students_secondary_education | Boarding students in secondary education |
| external_students_higher_education | External students in higher education |
| half_board_students_higher_education | Half-board students in higher education |
| boarding_students_higher_education | Boarding students in higher education |
This data set provides information about the single-parent families in each municipality. The census is made two years in a row every five years since 2007. This data set contains 606 observations.
| Variable | Meaning |
|---|---|
| geocode | Departmental code from INSEE |
| libgeo | Name of department |
| year | Census year |
| sing_par | Number of single-parent families |
This is a time based data set that gives us information on the COVID 19 tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 30th November, the data set contained 82’394 observations.
| Variable | Meaning |
|---|---|
| department_code | Department code |
| test_week | Week of the tests. Format yyyy-Sww |
| educational_level | Description of the age group as [m-n], m and n being the lower and upper limits. |
| age_group | Denomination of the age group. n-1 is used in this case excepet for the oldest group where 18 is used |
| pop | Population |
| positive | Weekly patients testing positive |
| tested | Weekly patients tested |
| incidence_rate | Incidence rate |
| positivity_rate | Positivity rate |
| screening_rate | Screening rate |
Loading of the data All but the single_parent dataset are CSV files with semicolons as separators. The single_parent data set is in excel format, so we have to use read_excel. We used “skip = 4” because the document includes extra header information rows.
DNB_par_etablissement <- read_delim(here::here("data/DNB-par-etablissement.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
Etablissements_labellises_generation_2024 <- read_delim(here::here("data/Etablissements-labellises-generation-2024.csv"),";", escape_double = FALSE, trim_ws = TRUE)
Hebergement_eleves_etablissements_2d <- read_delim(here::here("data/Hebergement-eleves-etablissements-2d.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
insee_rp_hist_xxxx <- read_excel(here::here("data/insee_rp_hist_xxxx.xlsx"), skip = 4)
covid_sp_dep_heb_cage_scol_2022_11_30_19h01 <- read_delim(here::here("data/covid_sp_dep_heb_cage_scol_2022_11_30_19h01.csv"), ";", escape_double = FALSE, trim_ws = TRUE)We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :
rename_df <- function(df, x){
if (ncol(df) == length(x)){
names(df) <- c(x)
df <- as_tibble(df)
} else {
stop("Vector is not the right length")
}
}
dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)xx,xx% we want it as a
double of the form xx.xx
dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
gsub(",",".", dnb_results[["success_rate_pct"]])))department_fr and drop the
overseas collectivities (COM).
dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
dnb_results <- dnb_results %>%
dplyr::filter(!department_fr %in% c("Polynesie-Française","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "NA", "-"))dnb_results <- dnb_results %>%
mutate(without_pct = admitted_without/admitted*100,
AB_pct = admitted_AB/admitted*100,
B_pct = admitted_B/admitted*100,
TB_pct = admitted_TB/admitted*100
)dnb_results_dep <- dnb_results %>%
select(session, department_code, region:TB_pct) %>%
group_by(department_fr, session, region) %>%
summarise(registered = sum(registered),
present = sum(present),
admitted = sum(admitted),
admitted_without = sum(admitted_without),
admitted_AB = sum(admitted_AB),
admitted_B = sum(admitted_B),
admitted_TB = sum(admitted_TB),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE))We can see the summarized data set dnb_results_dep below.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
establishment_24 <- establishment_24 %>%
mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
month(label_start_date) > 7 ~ year(label_start_date)+1),
session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
month(label_end_date) > 7 ~ year(label_end_date)+1)
)department_fr.
establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")We can see on the map below, that the data set contains establishments from the overseas collectivities (COM) but from the French international schools as well.
As previsouly discussed we have decided to keep only data from mainland France. We had to make sure that we also removed the French international schools. We took the opportunity to remove unused variables.
establishment_24 <- establishment_24 %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")#"NA" and "-" makes sure that we have no more International schools.
#establishment_24 has a lot of variables which we will for sure not use
establishment_24 <- establishment_24 %>%
select(-c(postcode:mail,city_school,QPV:SEGPA,establishment_web:ministry, precision_on_localisation))The “Diplome National du Brevet” is the diploma received at the end of “collège”. We have to keep establishments which are “collège” and drop all the rest.
Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize by counting the number of establishment created each session by department. The end of the label period is not of interest for us as the first end date is on the 9.01.2021 which is categorised as session 2022 and we have results for dnb up until session 2021.
establishment_24_dep <- establishment_24 %>%
select(region, department_fr, session_started) %>%
group_by(department_fr, session_started) %>%
summarise(establishment = n())We can see the summarized data set establishment_24_dep below.
housing_names <- c("year_back_to_school", "region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)session variable as
year_back_to_school refers to the beginning of the school
year and not the exam session.
student_housing <- student_housing %>%
mutate(session = year_back_to_school + 1) %>%
select(year_back_to_school,session, everything()) #here just to order variablesdepartment_fr and remove the
departments outside mainland France.student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
student_housing <- student_housing %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")student_housing <- student_housing %>%
select(-c(contains("higher")))student_housing <- student_housing %>%
mutate(external_students_rate = external_students_secondary_education/students_secondary_education*100,
half_boarders_students_rate = half_boarders_students_secondary_education/students_secondary_education*100,
boarding_students_rate = boarding_students_secondary_education/students_secondary_education*100)housing_dep <- student_housing %>%
select(session, region, students_secondary_education:boarding_students_rate) %>%
group_by(session, department_fr) %>%
summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE),
boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE),
students_secondary_education = sum(students_secondary_education, na.rm = TRUE),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE))We can see the summarized data set housing_dep below.
sg_parent_names <- c("geocode", "department", "session","sing_par")
single_parent <- rename_df(insee_rp_hist_xxxx, sg_parent_names)single_parent[["session"]]<- as.double(single_parent[["session"]])
single_parent[["sing_par"]]<- as.double(single_parent[["sing_par"]])department_fr and remove the
department outside of mainland France.single_parent$department_fr <- stri_trans_general(single_parent$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
single_parent <- single_parent %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")We can see the final table single_parent below.
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(covid_sp_dep_heb_cage_scol_2022_11_30_19h01,covide_names)covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))test_date categorizing each week. We chose the first day of
the week. As we had the week number we had to select the week number and
then for each year add seven days to the Monday of the first week of the
year. The second variable is the session. A session is
categorised from August to July of the next year. As our argument will
be set on the month, we might have some test done the first days of
august count towards the “wrong” session. The number of Covid cases in
August are relatively low compared to the rest of the year and it
represents at maximum 6 days of tests. Therefore we consider this margin
of error to be satisfactory.covid_in_schools <- covid_in_schools %>%
mutate(test_date = case_when (as.numeric(substr(test_week, 1,4))== 2020
~ lubridate::ymd('2019-12-30') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2021
~ lubridate::ymd('2021-01-04') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2022
~ lubridate::ymd('2022-01-03') + lubridate::weeks(as.numeric(substr(test_week, 7,8)))),
session = case_when(month(test_date) <= 7 ~ year(test_date),
month(test_date) > 7 ~ year(test_date)+1))
department_fr and region. We use the
department_fr and region variables from
dnb_results. We join the two datasets by department_code.
To do this, we first need to match the two by removing the first
character of the department_code from dnb_results.reg_department <- dnb_results %>%
select(c("department_code", "department_fr", "region")) %>%
unique()
reg_department$department_code <- substring(reg_department$department_code, 2)
covid_in_schools <- right_join(x = covid_in_schools, y = reg_department, by = "department_code")covid_in_schools <- covid_in_schools %>%
filter(educational_level == "[11-15)")We can see the final table covid_in_schools below.
We will use the ggplot France map for our visualizations
map <- map_data("france")The region variable is in fact the departments. We rename it “department_fr” to fit in with the other data sets.
colnames(map)[5]<- "department_fr"To explore this data set we have decided to start on a national level to analyse the global tendency. We will then go down a level to a regional analysis to compare the number of students and see which region performs better. An analysis at the regional level will then be performed to dig deeper into the success rate and the graduation rate for each mention. To be complete with our analysis, we will see the results by establishment for the best and worst performing establishments in 2020. We will use their results of 2006 in comparison.
For this first visualisation we wanted to see the evolution of the
DNB results over time in order to have an overview of the data we were
working with. For this, we created a data set summarizing the results at
the national level. We simplified our visualisation process by pivoting
the table. During the process, we created four new variables
Candidates, Number_of_students,
Mention_type and Rate.
France_results <- dnb_results %>%
group_by(session) %>%
summarise(registered = sum(registered),
present = sum(present),
admitted = sum(admitted),
admitted_without = sum(admitted_without),
admitted_AB = sum(admitted_AB),
admitted_B = sum(admitted_B),
admitted_TB = sum(admitted_TB),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
pivot_longer(c(registered, present,contains("admitted")),
names_to = "Candidates",
values_to = "Number_of_students") %>%
pivot_longer(c(contains("pct")),
names_to = "Mention_type",
values_to = "Rate")For this first graph, we plotted the Number_of_students
and grouped them by Candidates.
p <- France_results %>%
ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Number of students")
ggplotly(p, tooltip = c("x" ,"y"))According to the graph, we see that the number of admissions is increasing over the years. In fact, there is 704’742 admitted students in 2021 compared to only 572’236 in 2006. Though, we notice a slight decrease in 2018 and 2019. The positive trend resumed after that until 2021 and reached its peak in 2020 with 716’237 admissions. Similarly, the registration number follows the same tendency. In fact, 729’803 students had registered in 2006 compared to 801’721 students in 2021, for an increase of 9.85%, which is the highest record that has been stated so far. Also, the presence number broke the record in 2021 as it reached 795’209 students.
For the the analysis of the success rate and the rate of achievement
of honours. We plotted the Rate and grouped them by
Mention_type.
p <- France_results %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Rate in %")
ggplotly(p, tooltip = c("x" ,"y"))The number of students that were admitted without a mention has been decreasing since 2006. It reached 284’774 students (53.85%) in 2006, compared to 154’095 students (25.11%) in 2021. The negative tendency calmed itself during the 2017-2021 period. As it is the case for all mentions, there is a complete shift in 2017. This will be analysed in further detail in the part one of our analysis. Contrary to the students graduating without honours, the number of students admitted with the TB distinction is showing a positive pattern. It grew from 31’563 (4.49%) in 2006 to reach 204’523 (23.98%) in 2021. The pattern for the two other distinctions are relatively similar as they remained stable up until 2017 but the change was not as drastic as for the first two distinction mentioned. The number of students that are admitted with a good mention show a positive pattern. In fact, the number has developed from 91’719 students (13.66%) in 2006 to 180’774 students (25.07%) in 2021. The number of students admitted with an AB distinction is increased slightly from 2006 to 2021. It developed from 164’180 (28%) students in 2006 to 165’350 students (25.83%) in 2021, representing a growth rate of 0.71%.
We started with a comparative analysis in order to compare the success rate between the regions. For this, we needed to create the charts below based on the dnb-results data set.
We created this bar plot in order to visualize the average success
rate of each of the French regions. We used the function group_by in
order to group our success rate data by region: We took the
success_rate and the region, we computed the
average success rate for each region and, we plotted them in the graph
below.
p <- dnb_results %>%
select(success_rate_pct, region, department) %>%
group_by(region) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = region,
y = success_rate,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Average success rate by region, 2006-2021", y = "Rate in %")+
theme_ipsum() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))On average, French regions have performed similarly over the years. However, the region of Bretagne took the lead, achieving a success rate of 89.8%. The region Nouvelle-Aquitaine followed with a success rate of 86.6%. On the third position, Auvergne-Rhône-Alpes recorded a success rate of 86.4%.
To have a clearer idea about the regions that performed the best, we created the following bar plot. It is based on summarized data to help us visualize the regions that recorded the highest number of admissions so far. Students’ distinction has once again been taken into consideration. Hence, we could have an idea about the regions that recorded the highest number of admissions for each of the distinctions until 2021.
p <- dnb_results %>%
select(admitted, region) %>%
group_by(region) %>%
summarise(admitted = sum(admitted)) %>%
ggplot(aes(x = region,
y = admitted,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Number of admitted by region") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_without, region) %>%
group_by(region) %>%
summarise(zero = sum(admitted_without)) %>%
ggplot(aes(x = region,
y = zero,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_AB, region) %>%
group_by(region) %>%
summarise(AB = sum(admitted_AB)) %>%
ggplot(aes(x = region,
y = AB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_B, region) %>%
group_by(region) %>%
summarise(B = sum(admitted_B)) %>%
ggplot(aes(x = region,
y = B,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_TB, region) %>%
group_by(region) %>%
summarise(TB = sum(admitted_TB)) %>%
ggplot(aes(x = region,
y = TB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))Based on the chart, Ile de France has recorded the highest number of admissions since 2006: 1’914’458 students got admitted in this region so far. The second position was assigned to Auvergne Rhône-Alpes, with a total number of admissions equal to 1’0276’655 students. On the third position, the region Hauts-de-France registered 99’0610 admissions since 2006. Also, Grand Est, Nouvelle-Aquitaine and Occitanic have performed similarly over the years, recording around 85’000 admissions over the period [2006-2021]. On the contrary, Corse has recorded the least number of admissions so far: Only 41’541 students were admitted in this region over the same time interval.
Taking the distinction into consideration, we notice that the regions with the highest number of admissions are the ones with the highest number of all the distinctions. As a matter of fact, Ile de France continues to lead the number of highest honour (TB), high honours (B), honours (AB) and standard pass (without) admissions.
The results are not surprising. Indeed, according to statistics, Ile de France is the most populated region, with an approximative number of 12’262’544 inhabitants. Thus, we expected it to take the lead in terms of the number of admissions.
In order to take the time component into account, a linear chart has
been created: We used the group_by function in order to group
success_rate by session and
region. We relied on this chart to analyze the evolution of
the success rate throughout the years for each of the regions. Also, it
enabled us to take the admission’s distinction into consideration and
thus, analyze the development of students’ performance in all French
regions.
It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_results %>%
select(success_rate_pct, region, session) %>%
group_by(region, session) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = success_rate,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Success rate") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_results %>%
select(without_pct, region, session) %>%
group_by(region, session) %>%
summarise(zero = mean(without_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = zero,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_results %>%
select(AB_pct, region, session) %>%
group_by(region, session) %>%
summarise(AB = mean(AB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = AB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_results %>%
select(B_pct, region, session) %>%
group_by(region, session) %>%
summarise(B = mean(B_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = B,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_results %>%
select(TB_pct, region, session) %>%
group_by(region, session) %>%
summarise(TB = mean(TB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = TB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))The standard pass admissions’ rate follows a negative pattern: It has been decreasing since 2006 for all the regions. It dropped from 57.7 % in 2006 to 24.9% in 2021 for Ile de France, from 49.6% to 24.4% for Nouvelle-Aquitaine and from 53.4 % to 22.6% for Auvergne-Rhône-Alpes.
In 2017 and 2020, the standard pass admission’s rate as well as the rate of admissions with honours dropped remarkably, decreasing suddenly in 2020 for all the regions. However, this tendency was disrupted in 2019, in which this number increased noticeably. In 2021, we noticed a reproduction of 2017’s positive pattern translating thus, a deterioration of students’ performance once again.
Conversely, the number of students succeeding with a great distinction (highest and high honours) has been increasing in all French regions since 2006, reaching its highest point in 2020. Similarly, we notice a strong negative evolution in 2019 that resumed in 2021.
Following our national and regional analysis, we wanted to reduce the scope of our analysis further in order to have a better understanding of our data.
The first step of this analysis is to visualize the performance of
all French departments over the period [2006-2021]. In order to do that,
the following box plot has been created. For this, we needed to group
our data by department and session. Then, we
plotted the success rate of all French departments over the period
[2006-2021].
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = success_rate_pct_dep,
group = session,
fill = session,
text = department,
text2 = success_rate_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "", y = "Success rate in %",
title ="Success rate of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = without_pct_dep,
group = session,
fill = session,
text = department,
text2 = without_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with no mention in %",
title ="Rate of students with no mention of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = B_pct_dep,
group = session,
fill = session,
text = department,
text2 = B_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Bien in %",
title ="Rate of students with mention Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = AB_pct_dep,
group = session,
fill = session,
text = department,
text2 = AB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Assez Bien in %",
title ="Rate of students with mention Assez Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = TB_pct_dep,
group = session,
fill = session,
text = department,
text2 = TB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Très Bien in %",
title ="Rate of students with mention Très Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))Based on our data analysis, the success rate is following a positive pattern. It has been increasing since 2006. In the same way, we notice an improvement of the students’ success rate in 2017 as well as a remarkable decline in 2019. A boost in the students’ performance was recorded in 2021 in which, the success rate in all departments has reached its highest level since 2006.
Also, the number of students succeeding with honours, or a standard pass distinction has been decreasing since 2006 for all departments. It remarkably decreased in 2017 compared to the previous years, reaching its lowest level in 2020. However, the negative pattern resumed in 2021, in which the number of students graduating with honours or with a standard pass distinction started to increase.
In the same context, the number of students graduating with highest or high honours has been increasing since 2006 for all French departments. It peculiarly heightened in 2017 to reach its highest point in 2020, which reflects a noticeable improvement in the students’ performance.
Moreover, we can see that the distribution remains stable over time for all ditinctions apart for the distinction TB. Indeed, up until 2017, the variance was very small with Q1 and Q3 being distant of around 2 percentage points. From 2017 onwards, it increased to nearly 5 percentage points.
The following box plot was created in order to compare the performance of the best and “worst” department. It allowed us to visualize the difference between the highest achieving department, Paris in 2020 and one of the least achieving one, Eure et Loir. We chose to display the results of each establishment to see the dispersion in the results within one department.
We plotted the different success rates of the Paris department for the year 2006 and 2020 using the filter function. We proceeded the same way to plot the success rates of the Eure et Loir region over the same period.
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))Paris
On average, the success rate in all Paris establishments reached 93% in 2020. Representing thus, the highest performance recorded since 2006. Accordingly, the number of highest honour admissions outpaced the number of admissions with other distinctions.
Conversely, the success rate was much lower in 2006: It was around [ 40%- 60%], compared with [66%-100%] in 2020. In the same year, the number of students succeeding with a standard pass distinction was the highest, while the number of admissions with highest honour represented the smallest percentage.
Eure et Loire
In 2020, the results of the department of Eure et Loire followed the same pattern as those of the Paris department. It reached its peak in 2020, with an average success rate of 87.90%, compared to only 81% in 2006. Similarly, the number of admissions with highest honour outpaced the number of successes with other distinctions, in 2020.
However, in 2006, the number of students graduating with a standard pass distinction outstripped the one of those graduating with all the other distinctions.
Overall, in 2020, the Paris department outperformed the department of Eure et Loire, with an average success rate of 92.40% compared to only 87.90% for Eure et Loire. In the same context, the number of admissions with highest honour in the department of Paris outstripped largely the one of Eure et Loire. It was around [29.17% -55.21%] in Paris and around [14.52%-33.53%], in the department of Eure et Loire. Also, the number of standard pass admissions was remarkably higher in Eure et Loire. It represented 24.44% of the total number of admissions in 2020, compared to only 12.97% in Paris. Similarly, the department recorded an average rate of admissions with the honours distinction equal to 22.86%, compared to only 17.59% in Paris. We see that the dispersion pattern of the establishment is rather similar between the departments but changes between 2006 and 2020.
After analyzing the data provided by the dnb-result data set and understanding the evolution of French secondary students’ performance during the years [2006-2021], as well as the most significant events that marked this period. It is time to explore the Establishment_24 data set.
We started our analysis by mapping the establishments labelled Generation 2024 across France using the ggplot’s functions, geo_polygon combined with geom_point. This helped us visualize the geographic distribution of the labelled establishments and understand where they are mostly located.
p <- ggplot() +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") +
geom_point(data = establishment_24, aes(x = longitude , y = latitude, text = department_fr, text2 = establishment), size = 0.5)+
coord_map() +
labs(x = "",
y = "",
title = 'Establishment labelled "Génération 2024') +
map_theme
ggplotly(p, tooltip = c("text", "text2"))Based on the map, we conclude that the establishments labelled generation 2024 are mostly located in the region of Île-de-France, Auvergne-Rhône-Alpes, Occitanie and Nouvelle Aquitaine.However, we can hardly conclude anything with such graph and data. We decided to use the summary of labelled establishment by department for the rest of this exploration.
After having a general idea about the geographic distribution of generation 2024’s establishments, we created this second map to analyze their evolution in the period [2017-2022], for each department. For this, we used the function filter and ggplot to map for each year, starting from 2017, the number of new labelled establishment, and this for all French departments.
establishment_24_dep_map <- left_join(x = map[,-6], y = establishment_24_dep, by = "department_fr")
p <- establishment_24_dep_map %>%
filter(session_started == 2017) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment labelled Generation 2024 in 2017")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2018) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment labelled Generation 2024 in 2018")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2019) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = "New establishment labelled Generation 2024 in 2019")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2020) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = "New establishment labelled Generation 2024 in 2020")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2021) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = "New establishment labelled Generation 2024 in 2021")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2022) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = "New establishment labelled Generation 2024 in 2022")+
map_theme
ggplotly(p, tooltip = c("text","fill"))At first glance, we notice the positive evolution of the number of labelled establishments. In 2017, approximatively, twenty labeled establishments were recorded within the French departments, with four registered by the Nord, being the highest number reported that year. This tendency continued until 2020, when almost all departments recorded new labeled establishments.
However, this pattern was disrupted in 2021, when only 15 departments recorded an evolution of the number of their labeled establishments. Surprisingly, 2022 has been the peak year so far: Only four departments did not declare an increase in this number.
At this point of our analysis, we wanted to analyze the percentage of labeled establishments across all French departments. To do that, we computed the rate of 2024 establishments for each department. We first counted the number of establishment by department using the establishments present in the data set dnb_results. From this, we just had to measure the rate of labelled establishment. Next, we used the ggplot function to map our data.
# We first need to count the number of establishment per department
nb_est <- dnb_results %>%
select(school_id, department_fr) %>%
group_by(department_fr) %>%
summarise(n_distinct(school_id))
# We include the values with the one from establishment_24
establishment_24_dnb_dep <- left_join(x = nb_est, y = establishment_24_dep)
# We measure the rate of labelled establishment
establishment_24_dnb_dep <- establishment_24_dnb_dep %>%
group_by(department_fr) %>%
summarise(establishment = sum(establishment),
nb_est = mean(`n_distinct(school_id)`)) %>% #use mean to keep the value of n_distinct(schoold_id)
mutate(rate = establishment/nb_est*100)
establishment_24_dnb_dep_map <- left_join(x = map[,-6], y = establishment_24_dnb_dep, by = "department_fr")
p <-establishment_24_dnb_dep_map %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= rate), color = "black") +
scale_fill_viridis(name = "Rate of labelled establishment in % ") +
labs(x = "",
y = "",
title = "Establishment 2024 in 2022")+
map_theme
ggplotly(p, tooltip = c("text","fill"))Based on the map, the department of Hautes-Pyrenees has the highest number of labeled establishments so far, with a rate of 52.78% in 2022. Haute-Corse took the second position, having 50% of its establishments labelled by the end of 2022. On the third position, 43.14% of the establishmentsof of Val-Doise were labeled by the end of 2022. Other departments have also shown interesting rates such as Essonne (41.43%), Corse du Sud (40%), Tarn-et-Garonne (40%).
It is interesting to see that even if the Olympic games will take place in Paris the establishments from the Ile de france region have relatively low labeling rates. Indeed apart from Val-Doise and Essone which have high labeling rates, the other departments are rather around 20%. The explanation could be related to the number of establishments in these departments. For example, Paris has 216 establishments whereas Val-Doise and Essone have respectively 153 and 140 establishments. Similarly the two Corsican departments and Hautes-Pyrenees only have 15, 16 and 36 establishments. To conclude it explains why some departments may have higher rates but based on the principle that establishments of a same department have a smimilar number of students and that we will compare this data to success rate by department it should not affect our end results.
After analyzing French secondary students’ success rate on a national, regional and a departmental level, and exploring the establishment_24 dataset, it is time to focus on the students housing dataset. Our goal is to analyze the evolution of the different housing modes between 2020 and 2022.
The following graph plots the number of students and groups them by their Housing option.
p <- student_housing %>%
select(session, external_students_secondary_education, half_boarders_students_secondary_education, boarding_students_secondary_education) %>%
group_by(session) %>%
summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE),
boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE)) %>%
gather("Housing_option", "students", 2:4) %>%
ggplot(aes(x=session,
y=students,
fill=Housing_option)) +
geom_bar(stat="identity", position=position_dodge())+
ggtitle("Students per housing option in France")+
xlab("Session")+
ylab("Number of students")+
scale_fill_viridis(discrete = TRUE)+
theme_ipsum()
ggplotly(p, tooltip = "y")Based on the chart, half boarding students represents the largest category, recording 3’654’401 students in 2020, compared to 1’536’776 external students, and only 191’100 boarding students in the same year. This category remains the undisputed leader in 2021 and 2022. The students’ distribution has not changed remarkably in the period [2020-2022]. However, we notice a slight decrease in the number of half-boarders ’students. French secondary students are less and less interested in the half-boarding housing mode. They were 3’654’401 students in 2020, compared to 3’593’806 in 2022, representing a negative variation of 1.65%. Conversely, the external housing mode is attracting more and more students. The number of externals showed a positive change of 4.94% in 2022 compared to 2020. It increased from 1’536’776 students in 2019 to 1’612’770 in 2022.
To conduct a departmental analysis, we created the following graph, in which we plotted for each of the housing modes, the rate of students recorded in each specific category for all French departments.
It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = external_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = external_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of external students in %",
title ="Rate of external students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = half_boarders_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = half_boarders_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of half-boarders students in %",
title ="Rate of half-boarders students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = boarding_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = boarding_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of boarding students in %",
title ="Rate of boarding students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))Half boarding housing mode attracts the higher number of students in all French departments. On average, 69 % of students opted for this residence mode in 2020, 69.16% in 2021 and 68.88% in 2022. In 2020, 83.43% of Cotes_Darmor’ students were half boarders, which is the highest rate recorded in all French departments. Contrarily, only 49.7% of Seine-Saint-Denis’ students chose this housing mode, which is the lowest rate recorded that year. In 2021 and 2022, we notice the same distribution of half boarders’ students among French departments, with the highest rate registered by Cotes_Darmor and Vendee, and the lowest recorded with Seine-Saint-Denis. External housing mode is the second common one after the half boarding mode. In 2020, external students represented 24.71% of the total number of students, compared to 24.61% and 24.70% in 2021 and 2022, respectively. Landes, Cotes_Darmor and Vendee are the departments with the least number of external students, with a respective rate of 8.01 %, 8.42% and 8.49%. Seine-Saint-Denis occupies the first position, recording the highest number of externals: 50%. Similarly, in 2021 and 2022, those three departments recorded the lowest rates while the department of Seine-Saint-Denis recorded the highest. Finally, the boarding housing mode attracts the least number of students. It recorded the lowest rates over our study period. On average, boarding students represented 5.69% of the total number of students in 2020, 5.35 % in 2021 and 5.36 % in 2022. In 2020, Paris and Seine-Saint-Denis recorded the lowest rate of boarding students, 0.260% and 0.287 % respectively. Lozere in the other hand, recorded 25.79%, which is the highest rate of all the departments. We notice the same tendency in 2021 and 2022.
Our last data set related to socioeconomical factor is the single
parent one, The first visualisation we wanted to do was a barplot that
would show the evolution of the total number of single-parent families
in France from 2007 to 2008. To do this, we had to isolate, the
sing_par and session variables, then
summarized the number of single-parents with “sum”. We then were able to
use geom_bar.
sp1_ <- single_parent %>%
select(c("session", "sing_par")) %>%
group_by(session)%>%
summarise (sing_par = sum(sing_par, na.rm = TRUE))
p <- ggplot(data = sp1_, aes(x = as.factor(session),
y = sing_par,
fill = as.factor (session),
text = sing_par))+
geom_col(stat = "identity" )+
scale_fill_viridis(discrete = TRUE, "Session") +
labs(x = "", y = "Number of single parent families")+
theme_ipsum()
ggplotly(p, tooltip = "text")Based on the barplot, we can make an overall analysis of the evolution of single-parent families in France over the years. Although we are missing several years, we can clearly see that the number of single-parent families has continued to increase since 2007. In fact, single-parent families numbered 2’427’110 in 2007, whereas in 2018, they were numbering 3’031’823. In just over 10 years, there has been a 20% increase.
We needed to create a data set to create maps.For this, we joined the
single_parent and map data sets using department_fr.
jmap_sp<- left_join(x = map[,-6], y = single_parent, by = "department_fr")p <-jmap_sp %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= sing_par), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))From the map, we realise that the departments with the most single-parent families are those with the most inhabitants (Bien dans ma ville, 2022). This is an unsurprising result which influence our choice to create variables for the data to become easily comparable.
To do this we first had to join single_parents and dnb_results_dep by
department_fr and session to have the number
of students per department. From this, we used the mutate function to
create a new variable that divides sing_par by the number
of students admitted by department. Thanks to this, we were able to
create maps of France showing a proportion of the number of
single-parent families per student admitted from 2007 to 2018. To create
the interactive maps with nice colours, we used the function
ggplotly.
sing_dnb <- left_join(x = single_parent, y = dnb_results_dep, by = c("department_fr", "session"))
sing_dnb <- sing_dnb %>%
mutate(single_parent_per_student_admitted = sing_par/admitted)
#Join with the map data set for the mapping.
sing_dnb_map <- left_join(x = map[,-6], y = sing_dnb, by = "department_fr")
p <-sing_dnb_map %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2008) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2008")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2012) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2012")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2013) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2013")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2017) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2017")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2018) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2018")+
map_theme
ggplotly(p, tooltip = c("text","fill"))It can be noted that the departments in the East have the lowest single parent rate (Mayenne, Maine-et-Loire, Vendée), while the departments in the Iles-de-France region (Paris, Seine- Saint-Denis, Marne), those around the Mediterranean Sea (Bouche-du-Rhone, Pyrenees, Herault) and those in Corsica (Haute-Corse, Corse du Sud) have the highest rates. It will therefore be important to keep an eye on these departments when analyzing the results of Diploma Brevet National (DNB).
First, we wanted to get an overview of COVID positive cases in France
over the covid period. We selected from covid_in_school the variables
positive and test_date, group by
test_date and summarised positive with sum. We
then use ggplot with test_date as the x-axis and
positive for the y-axis.
p <- covid_in_schools %>%
select(positive, test_date) %>%
group_by(test_date) %>%
summarise(positive = sum(positive)) %>%
ggplot( mapping = aes(x= test_date, y = positive)) +
geom_line() +
labs(title = "French Covid-19 cases for the age group 11 to 15 years old (2020-2022)", x = "Date", y = "Number of cases")+
theme_ipsum()
ggplotly(p, tooltip = c("x","y"))From 2020 until the end of 2021, we notice waves of covid cases. Indeed, every 6 months, the number of positive cases increases to nearly 20’000 positive cases per day and then drops to around 1’000 positive cases per day. However, we observe a clear increase, with a peak of more than 235,000 positive cases per day, in January 2022. After this increase, the number of positive cases fell significantly to remain around 2’000-10’000 positive cases per day. The 2022 academic year is therefore a year to keep an eye on, for further researches as we do not yet have the dnb results of this session.
For the second exploratory analysis, we wanted to compare the
positive covid cases by regions. As population differs between
departemnts, it was easier to compare using incidence_rate.
To create the map, we selected from the covid_in_schools dataset the
variables department_code, incidence_rate,
session, region, department_fr
and test_date, did a group_by region and
test_date and summarised positive by doing an
average.
p <- covid_in_schools %>%
select(c("department_code", "incidence_rate", "session", "region", "department_fr", "test_date", "positive")) %>%
group_by(region, test_date) %>%
summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE),
positive = sum(positive)) %>%
ggplot() +
geom_line(mapping = aes(x = test_date, y = incidence_rate, color = region))+
scale_color_viridis(discrete = TRUE) +
labs(title = "Covid-19 incidence rate for the age group 11 to 15 years old by region (2020-2022)", x = "Date", y = "Incidence rate") +
theme_ipsum()
ggplotly(p, tooltip = c("x","y", "color"))For the age group 11 to 15, Ile-de-France and Auvergne-Rhone-Alpes seem to be the two regions which were the most affected in the first waves. The Covid19 did not spare any department as we their incidence rate closely follow each others. The only slight exception is the wave in March-April 2022 where a few departments like Corse and Bretagne were more affected.
The last vizualization for the covid dataset is a mapping of incidence rate average by departments from session 2020 to 2023.
covidpos_dep <- covid_in_schools %>%
select(c("department_code", "incidence_rate", "session", "department_fr" )) %>%
group_by(department_fr, session) %>%
summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE))
covidpos_dep <- left_join(x = map[,-6], y = covidpos_dep)It is to be noted that for a better visual interpretation the scale is adapted for each graph.
p <- covidpos_dep %>%
filter(session == 2020) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2020")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale is adapted for each graph.
p <- covidpos_dep %>%
filter(session == 2021) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2021")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale is adapted for each graph.
p <- covidpos_dep %>%
filter(session == 2022) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2022")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale is adapted for each graph.
p <- covidpos_dep %>%
filter(session == 2023) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2023")+
map_theme
ggplotly(p)During the 2020 session, the average incidence rate is relatively low in all departments of France, except in Mayenne where the incidence rate is 17.96, which is the highest rate. For the 2021 session, the scale indicates a clear increase in the proportions of average incidence rate in each of the departments, ranging from 246.9 to 52.6 cases per department. The western part of France is much more affected than the eastern part. We can hypothesise that this increase may be due to the fact that these are the departments closest to the borders and other countries. During the 2022 session, we can observe an overall increase in the incidence rate in each of the departments. Aveyron, Landes, Pyrénées-Atlantiques and Corse du Sud are the departments with the highest rates, with a maximum of 1201. For the 2023 session, the incidence rate fell significantly, particularly in Ile de France. This is because the school year is not over yet. La Creuse is the department with the highest rate with 420.
In the first part of our project, we focused on exploring our data. We created several charts; bar plots, box plots, maps etc.; to have a general comprehension of our data and retrieve the general tendencies.
This part of the project will be more focused on finding answers in order to draw conclusions about the evolution of the French secondary students’ performance and identifying the variables that have impacted it the most.
We have seen previsouly that French secondary students’ performance is showing a positive evolution. The number of admissions has been increasing since 2006, it is 23.15% higher in 2020 compared to 2006. Not only the overall success rate, but also the distinctions that students get have changed between 2006 and 2021. During this period, more and more students passed with a TB distinction (23.98% in 2021), compared to only 4.49 % in 2006. In the same way, the rate of students passing with no mention decreased remarkably. It was 53.85% in 2011, compared to only 25.11% in 2021.
p <- France_results %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Rate in %")
ggplotly(p, tooltip = c("x" ,"y"))2017: A drastic improvement in students’ performance was recorded in 2017.
Students’ performance has drastically improved in 2017, recording a success rate of 88.74% for the first time since 2006. Also, the number of admissions with a TB distinction expanded during that year, reaching 25.12% compared to only 4.49% in 2006. In the same context, admissions without a mention reached their lowest point so far, with a rate of 23.40%.
Based on further literature research, we discovered that this major shift in the French secondary students’ performance has to do with the reform “learn better to succeed better”, that was brought into force at the beginning of the 2016 school year (accounted for session 2017 in our analysis) by the minister of education Najat Vallaud-Belkacem. This reform is articulated around two main points: redefining the common core of knowledge and skills as well as the content of the programs to make them more comprehensible, more progressive, and more coherent. The focus was mainly on these three areas: French, mathematics, and history. Moreover, teachers had to devote 20% of the class time to new pedagogical methods such as the practice of multidisciplinary teaching, providing personalized support, and engaging students in collaborative approaches.
2019: Sudden decrease in students’ performance
Students’ performance deteriorated sharply in 2019. The rate of admissions with a TB distinction dropped remarkably, passing from 25.12% in 2017, to 19.66% in 2019. Accordingly, the number of admissions without a mention increased, reaching 30.63% compared to 23.40% in 2017.
2020: students’ performance reached its peak
In 2020, students’ performance has reached its peak. 90.19% of French secondary school students succeeded that year, compared to only 86.14% in 2019. On one hand, the number of admissions with a TB mention improved drastically compared to 2019, with a rate of 28.45%. On the other hand, the admissions without a mention registered a rate of 21.07%, being the lowest rate recorded since 2006.
Based on our research, we assume that this unique performance has to do with the decision of exams cancellation that was taken by the education minister Jean-Michel Blanquer in 2020, due to the corona virus pandemic: France canceled traditional end of secondary school exams in favor of continuous assessment. So, instead of sit-down tests in the summer, students received an average score, for each subject, based on the grades scored in tests and homework throughout the academic year.
We first mapped the establishment labelled Generation 2024 onto a map of the average success rate for the brevet during the period 2006-20021. This visualisation helped us determine whether a clear link could be made or if further analysis need to be performed. To create the map, we first had to calculate the average success rate and join the result with the map data set.
result <- dnb_results %>%
select(department_fr, success_rate_pct) %>%
group_by(department_fr) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE))
result_map <- left_join(x = map[,-6], y = result)Then, with the help of ggplot, we have superimposed the map of the average success rate and the dotted map of the localisation of the labelled establishment presented in chapter 4.2.
p <- ggplot() +
geom_polygon(data = result_map, aes(long,lat,
group = group,
fill = success_rate,
text = department_fr)) +
geom_point(data = establishment_24, aes(x = longitude, y = latitude, text2= establishment), size = 0.5)+
coord_map() +
scale_fill_viridis(name = "Average sucess rate, 2006-2021")+
labs(x = "",
y = "",
title = "Labelled Establishemnt vs average success rate") +
map_theme
ggplotly(p, tooltip = c("text2", "text") ) #due to a strange behavior of ggplotly we must add the "text" variable in the tooltip otherwise the success rate map is only partially displayed. Zooming on the map, one can see that there is no clear pattern of correlation between high achieving departments and great number of labelled establishment. Indeed, one can see that for example, there are many labelled establishment around Paris and the success rate varies greatly. Building from this, we do not expect to have a strong relation between the success rate and the labelling or not of an establishment.
To test this statement, we did a linear regression at the
establishment level. We used the Generation 2024 label as a boolean with
1 for labelled establishment and 0 for unlabelled ones. To create this
variable we first had to add the school_id and
session_started variables from establishment_24 to
dnb_results. We used left_join by the variable school_id to
keep all the establishments present in dnb_results. To add the variable
est_24, we used the mutate coupled with the case_when to
cover all cases. est_24 takes the value 1 when the
session is greater or equal than
session_started and the value 0 for all other cases which
are when session is smaller than
session_started or the establishment is not labelled and
the session_startedvalue is NA.
est_24_join <- establishment_24 %>%
select(school_id, session_started)
est_24_dnb <- left_join( x = dnb_results, y = est_24_join, by = c("school_id"))
est_24_dnb <- est_24_dnb %>%
filter(session >= 2017) %>%
mutate(est_24 = case_when(session >= session_started ~ 1,
session < session_started ~ 0,
is.na(session_started) == TRUE ~ 0 ))To have a visual representation of the possible relation between the
success rate and the labelling, we used geom_bin_2d with
est_24 as factor for the x-axis and
success_rate_pctfor the y-axis. We used this way of
plotting to display in a more visual way the dispersion of the
establishments on the y-axis. To do the linear regression we used the lm
function.
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = success_rate_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))lm_est_24_dnb <- lm(data = est_24_dnb, success_rate_pct ~ est_24)
tab_model(lm_est_24_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 88.09 | 88.00 – 88.19 | <0.001 |
| est 24 | -0.60 | -1.00 – -0.19 | 0.004 |
| Observations | 42472 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
The result of this linear regression shows that there could be a potential negative aspect to have the Generation 2024 label. The R squared being below 0.000 makes this analysis highly unreliable as it would explain less than 0.01% of the variation in success rate.
We decided to continue our analysis and proceeded to perform a linear regression with each distinction. To performed them we followed the same steps as for the success rate regression analysis.
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = without_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))lm_est_24_dnb <- lm(data = est_24_dnb, without_pct ~ est_24)
tab_model(lm_est_24_dnb)| without_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.44 | 25.31 – 25.57 | <0.001 |
| est 24 | -1.21 | -1.77 – -0.65 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = AB_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))lm_est_24_dnb <- lm(data = est_24_dnb, AB_pct ~ est_24)
tab_model(lm_est_24_dnb)| AB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.68 | 25.58 – 25.77 | <0.001 |
| est 24 | -2.30 | -2.70 – -1.89 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.003 / 0.003 | ||
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = B_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))lm_est_24_dnb <- lm(data = est_24_dnb, B_pct ~ est_24)
tab_model(lm_est_24_dnb)| B_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.27 | 25.18 – 25.36 | <0.001 |
| est 24 | -0.43 | -0.81 – -0.05 | 0.027 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = TB_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))lm_est_24_dnb <- lm(data = est_24_dnb, TB_pct ~ est_24)
tab_model(lm_est_24_dnb)| TB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 23.61 | 23.47 – 23.75 | <0.001 |
| est 24 | 3.94 | 3.34 – 4.53 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.004 / 0.004 | ||
The R squared is unsurprisingly very low again meaning that we can not conclude any strong relation between students achievements and the Generation 2024 label. It is still to be mentioned that it affects positively the TB attribution rate and negatively the rest.
We can conclude that from our analysis we do not see any pattern that the Generation 2024 label influences has any influence on the results of the establishments. It is not that surprising as the objectives of the label is truly to promote the olympics in Paris in 2024 and link the schoold world to the sport’s one. It has no requirements on a amount of sport done during the school time. The total amount of sport practised by students is therefore probably not much greater in a labelled institution than in a non-labelled one, hence, the little influence.
To analyse the impact of the housing offerings of the establishments
on the results of their student, we first need to join the
student_housing data set and the dnb_results one. We filtered out the
unnecessary or repetitive variables of dnb_results such as the academy
name and code or the education sector. We used inner_join by
session, school_id and
department_fr to keep only establishment appearing in
student_housing.
dnb_prep_housing <- dnb_results %>%
select(session, school_id, registered:TB_pct)
housing_dnb <- inner_join(x = student_housing , y = dnb_prep_housing, by = c("session", "school_id", "department_fr"))It is to be noted that the number of students between
students_secondary_education and registered
does not match as students of the four years of college are accounted
for in students_secondary_education.
We evaluated the possibility of doing a multiple linear regression by
measuring the correlation of the external_students_rate,
half_boarders_students_rate and
boarding_students_rate variables.
corrplot(cor(housing_dnb[c(17:19)]), col = viridis(256))
The correlation is unsurprisingly very high as if for example one
establishment offers a meal for dinner most of the students will take it
and very few will eat at home. Therefore, we will use single linear
regression to asses whether there exist a link between student results
and their habitual place of eating and living.
We simply used the lm function for each offering. We did not include graphs in the final report as they were completely overloaded.
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ external_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 91.92 | 91.72 – 92.12 | <0.001 |
| external students rate | -0.10 | -0.10 – -0.09 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.070 / 0.070 | ||
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ half_boarders_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 82.04 | 81.64 – 82.43 | <0.001 |
|
half boarders students rate |
0.10 | 0.10 – 0.11 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.084 / 0.084 | ||
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ boarding_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 89.43 | 89.28 – 89.57 | <0.001 |
| boarding students rate | -0.06 | -0.07 – -0.05 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.005 / 0.005 | ||
The p-value is excellent for each regression but as observed with student housing, the R squared is very low. However, we see some intersting variations in the intercept, a difference of 10%, that could prove to be insightful. To dig deeper into these variations, we performed a cluster analysis.
We performed the cluster analysis on the session 2021 and 2020 and results are very similar. We decided to leave the two analysis but going through one is sufficient to see the method used and understand the results.
To perform the cluster analysis, we filtered the session 2021 and
selected the variables department_fr, session,
school_id, external_students_rate,
half_boarders_students_rate,
boarding_students_rate from the joined data set. We then
removed th first three and scaled the rest of the data.
The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method and concluded that we need 4 clusters. The ratio between the between sum of square and the within sum of square is good at 81.5%.
fviz_nbclust(clust, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation
labs(subtitle = "Elbow method") #We can determine the optimal number of cluster, 4 clusters seems to be reasonable
# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)
# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)We can see that the four clusters are well defined with cluster one, two and four being differentiated by the x-axis and cluster 3 spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster one and two have higher half boarder rate than group three and especially group four. Group four will represent establishment without any catering offerings for the students. Group three is establishments with at least some of the students in boarding schools.
fviz_pca_var(PCA(clust, graph = FALSE))To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster results in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.
clust1 <- tibble(department_fr = names(km.res$cluster),
cluster = km.res$cluster)
clust_housing_dnb$cluster <- clust1$cluster
housing_dnb_2021 <- inner_join(x = housing_dnb, y = clust_housing_dnb)To analyse the data by cluster we needed to summarise it by cluster.
You can see the results in the table below with an added variable
establishment which is the number of establishment present
in each cluster.
clust_h_dnb_2021 <- housing_dnb_2021 %>%
group_by(cluster) %>%
summarise(establishment = n(),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
round(digits = 2)
datatable(clust_h_dnb_2021, options =list(scrollX = "300px"))We visually represented the results in a spider chart using the radarchart function.
op <- par(mar=c(0, 0, 0, 0))
radar <- clust_h_dnb_2021 %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4)#, title = "Radar Graph")
# Add a legend
legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)
par(op)For better clarity, we have also displayed each cluster individually.
# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))
##### Cluster 1
#filter cluster 1
radar <- clust_h_dnb_2021 %>%
filter(cluster == 1)%>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 1", cex.main = 1 )
##### Cluster 2
radar <- clust_h_dnb_2021 %>%
filter(cluster == 2) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , pfcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 2", cex.main = 1 )
##### Cluster 3
radar <- clust_h_dnb_2021 %>%
filter(cluster == 3) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , pfcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 3", cex.main = 1 )
##### Cluster 4
radar <- clust_h_dnb_2021 %>%
filter(cluster == 4) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 4", cex.main = 1 )
par(op)The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies slightly between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home.
The difference is more significant for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the fourth cluster are the one achieving the best results with 29.3% of them receiving the distinction TB and only 20.8% not receiving any distinction. The rates are much worse if your school has a boarding offer as 35% of students do not receive any distinction and only 10% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.
To perform the cluster analysis, we filtered the session 2021 and
selected the variables department_fr, session,
school_id, external_students_rate,
half_boarders_students_rate,
boarding_students_rate from the joined data set. We then
removed th first three and scaled the rest of the data.
The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method. For 2020, the deicision is not as clear as for 2021 and the hesitation is between three and four clusters. As we have four clusters in 2021, we decided to also take four cluster for the 2020 aalysis. The ratio between the between sum of square and the within sum of square is good at 81.1%.
fviz_nbclust(clust, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation
labs(subtitle = "Elbow method") #We can determine the optimal number of cluster, 4 clusters seems to be reasonable
# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)
# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)We can see that the four clusters are well defined with cluster one, three and four being differiated by the x-axis and cluster two spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster three and four have higher half boarder rate than group two and especially group one. Group one will represent establishment without any catering offerings for the students. Group two is establishments with at least some of the students in boarding schools.
fviz_pca_var(PCA(clust, graph = FALSE))To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.
clust1 <- tibble(department_fr = names(km.res$cluster),
cluster = km.res$cluster)
clust_housing_dnb$cluster <- clust1$cluster
housing_dnb_2020 <- inner_join(x = housing_dnb, y = clust_housing_dnb)To analyse the data by cluster we needed to summarise it by cluster.
You can see the results in the table below with an added variable
establishment which is the number of establishment present
in each cluster.
clust_h_dnb_2020 <- housing_dnb_2020 %>%
group_by(cluster) %>%
summarise(establishment = n(),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
round(digits = 2)
datatable(clust_h_dnb_2020, options =list(scrollX = "300px"))We visually represented the results in a spider chart using the radarchart function.
op <- par(mar=c(0, 0, 0, 0))
radar <- clust_h_dnb_2020 %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4)#, title = "Radar Graph")
# Add a legend
legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)
par(op)For better clarity, we have also displayed each cluster individually.
# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))
##### Cluster 1
#filter cluster 1
radar <- clust_h_dnb_2020 %>%
filter(cluster == 1)%>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 1", cex.main = 1 )
##### Cluster 2
radar <- clust_h_dnb_2020 %>%
filter(cluster == 2) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , pfcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 2", cex.main = 1 )
##### Cluster 3
radar <- clust_h_dnb_2020 %>%
filter(cluster == 3) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , pfcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 3", cex.main = 1 )
##### Cluster 4
radar <- clust_h_dnb_2020 %>%
filter(cluster == 4) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 4", cex.main = 1 )
par(op)The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the first cluster are the one achieving the best results with 33.3% of them receiving the distinction TB and only 17.9% not receiving any distinction. The rates are much worse if your school has a boarding offer as 26.4% of students do not receive any distinction and only 14.7% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.
From this analysis, we can conclude that having lunch and sleeping at school or at home does have a small influence on the results of the Dimplome National du Brevet. Better results are achieved students in establishment offering the lunch. Going home for lunch does seem to hinder academic success. Boarding schools might not the best choice for students aiming for top honours.
The first analysis is at the national level.We want to use two plot types: a bar plot that represent total single parent families and a line plot to represent the success rate. Both plots use session for x-axis. Since the range of the success_rate plot is much lower than the single parent families plot, the use of a secondary y-axis with an adapted scale is recommended.
sp <- single_parent%>%
select("session", "department_fr", "sing_par")
singpar_vs_dnb <- left_join(dnb_results_dep, sp, by = c("department_fr","session"))
singpar_vs_dnb%>%
select(success_rate_pct, session, sing_par, department_fr)%>%
group_by(session)%>%
summarise(success_rate_pct = mean(success_rate_pct),
sing_par = sum (sing_par, na.rm = TRUE)) %>%
ggplot()+
geom_col(aes(x = session, y = sing_par)) +
geom_line(aes(x = session, y = 30000*success_rate_pct), size = 1, color="blue", group = 1) +
scale_y_continuous(sec.axis = sec_axis(~./30000, name = "Success rate")) +
labs( x = "Session", y = "Single parent families" )+
theme_ipsum()
Despite the constant increase in the number of parental families over
time, DNB’s results do not appear to be influenced too much by these.
Indeed, logically, the more the number of single parent families
increase, the more the success rate should decrease. However, from 2007
to 2017, sing_par and success_rate both increased, and from 2017 to
2019, although sing_par continued to increase, the success_rate suffered
a drop from 88.74% to 86.14%.
We now want to do an analysis of single parent families against the results of DNB at the departmental level. We will therefore do a linear regression and use singpar_vs_dnb, a dataframe showing the results of DNB and sing_par par department_fr.
ggplot(data = singpar_vs_dnb,
mapping = aes(x = sing_par, y = success_rate_pct)) +
labs(title = "Single-parent families VS success rate", x = "sing_par", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)+
theme_ipsum()
#> Warning: Removed 960 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 960 rows containing missing values (`geom_point()`).
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
regression_sp <- lm(data = singpar_vs_dnb, success_rate_pct ~ sing_par)
tab_model(regression_sp)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 85.68 | 85.24 – 86.12 | <0.001 |
| sing par | -0.00 | -0.00 – -0.00 | 0.020 |
| Observations | 576 | ||
| R2 / R2 adjusted | 0.009 / 0.008 | ||
The 34 missing values are the overseas departments, which we did not
have the information for success_rate. Rsquared is equal to 0.009, and
p_value is equal to 0.0203 but according to the significance code, it is
not significant. Since variance is not explained, the model does not fit
our data. Therefore, success rate of a department does not depend on the
number of single parent families in the department.
In order to observe the influence of Covid on student performance, we are going to investigate the relationship between incidence rate and success rate of DNB in every department in France. To do so, we first had to do a left join between covid_in_school and dnb_results in order to create a new data frame called covid_vs_dnb, and then create a simple linear regression model.
c <- c("2020", "2021")
d<- dnb_results%>%
select("session", "department_fr", "success_rate_pct")%>%
filter(session %in% c)%>%
group_by(department_fr, session)%>%
summarise(success_rate_pct = mean(success_rate_pct))
co<- covid_in_schools%>%
select("session", "department_fr", "incidence_rate")%>%
group_by(department_fr, session)%>%
summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnb <- inner_join(d, co, by = c("session", "department_fr"))ggplot(data = covid_vs_dnb,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression <- lm(data = covid_vs_dnb, success_rate_pct ~ incidence_rate)
tab_model(regression)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 90.12 | 89.71 – 90.53 | <0.001 |
| incidence rate | -0.01 | -0.02 – -0.01 | <0.001 |
| Observations | 192 | ||
| R2 / R2 adjusted | 0.220 / 0.216 | ||
According to the residual results, the regression is more precise for
the high values than the low values. The p-value is close to 0,
therefore significant. The model seems usable. However, R squared is
around 0.2, meaning that it explains only 20% of the variation of the
success rate. Therefore, the result here are not very conclusive.
We then tried to do a linear regression for 2020 and 2021 separately because we thought that the 2020 low values could have distorted our model.
covid_vs_dnb2020 <- inner_join(d, co, by = c("session", "department_fr"))%>%
filter(session ==2020)
ggplot(data = covid_vs_dnb2020,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression2020 <- lm(data = covid_vs_dnb2020, success_rate_pct ~ incidence_rate)
tab_model(regression2020)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 90.12 | 89.60 – 90.64 | <0.001 |
| incidence rate | 0.01 | -0.17 – 0.20 | 0.880 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.000 / -0.010 | ||
covid_vs_dnb2021 <- inner_join(d, co, by = c("session", "department_fr"))%>%
filter(session ==2021)
ggplot(data = covid_vs_dnb2021,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression2021 <- lm(data = covid_vs_dnb2021, success_rate_pct ~ incidence_rate)
tab_model(regression2021)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 89.51 | 88.00 – 91.02 | <0.001 |
| incidence rate | -0.01 | -0.02 – -0.00 | 0.046 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.042 / 0.031 | ||
From these regression results, we see that we cannot use these models to explain the influence of the incidence rate and the success rate. Consequently, covid seems to not have a big impact on the dnb results. This finding is not so surprising because of the implementation of continuous assessment. In fact, according to the article by l’Etudiant (Cojean, 2020), the French government has taken special measures for DNB exams following Covid: since 2020, graduates were awarded on the basis of continuous assessment, meaning no final exam.
Since our linear regressions were inconclusive, the next step would be to observe whether or not covid influenced the attribution of the distinction. It can be hypothesized that given the continuous assessment, the rating has been more lenient. This would therefore lead to better results, thus more students admitted with mentions. In view of the results for 2020, which did not vary greatly, we will mainly focus on the attribution of distinction in 2021.To do so, we created covid_vs_dnbmention.
m<- dnb_results%>%
select("session", "department_fr", "TB_pct", "B_pct", "AB_pct", "admitted_without")%>%
filter(session == 2021)%>%
group_by(department_fr, session)%>%
summarise(TB_pct=mean(TB_pct), B_pct=mean(B_pct), AB_pct=mean(AB_pct), admitted_without=mean(admitted_without))
co<- covid_in_schools%>%
select("session", "department_fr", "incidence_rate")%>%
group_by(department_fr, session)%>%
summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnbmention <- inner_join(m, co, by = c("session", "department_fr"))ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = TB_pct)) +
labs(title = 'Incidence-rate of Covid cases VS Distinction "Très Bien"', x = "Incidence-rate", y = "Percentage of TB", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, TB_pct ~ incidence_rate)
tab_model(regression)| TB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 20.21 | 17.70 – 22.73 | <0.001 |
| incidence rate | 0.02 | 0.00 – 0.04 | 0.016 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.062 / 0.052 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = B_pct)) +
labs(title = 'Incidence-rate of Covid cases VS Distinction "Bien"', x = "Incidence-rate", y = "Percentage of B", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, B_pct ~ incidence_rate)
tab_model(regression)| B_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.68 | 24.36 – 27.00 | <0.001 |
| incidence rate | -0.00 | -0.01 – 0.00 | 0.281 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.013 / 0.002 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = AB_pct)) +
labs(title = 'incidence-rate of Covid cases VS Distinction "Assez Bien"', x = "Incidence-rate", y = "Percentage of AB", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, AB_pct ~ incidence_rate)
tab_model(regression)| AB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 28.59 | 27.14 – 30.05 | <0.001 |
| incidence rate | -0.02 | -0.03 – -0.01 | 0.002 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.099 / 0.089 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = admitted_without)) +
labs(title = "incidence-rate of Covid cases VS Admitted Without Distinction", x = "Incidence-rate", y = "Percentage of students admitted without distinction", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression <- lm(data = covid_vs_dnbmention, admitted_without ~ incidence_rate )
tab_model(regression)| admitted_without | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 9.87 | 6.89 – 12.85 | <0.001 |
| incidence rate | 0.05 | 0.03 – 0.07 | <0.001 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.204 / 0.196 | ||
In view of the various linear regressions, as covid cases increased, the academic government seemed to have granted more easily the highest distinction and without distinction. In fact, the linear regression of distinction TB and admitted_without indicate moderate positive correlations. This therefore leads to think that the grading was more lenient for the departments where covid was very present. Nevertheless, it is difficult to interpret these results as they are statistically not significant and all Rsquared are extremely poor.
Since the linear regression does not allow us to say much about the
influence of Covid on student success, we decided to do an analysis from
clusters and see if we can get anything interesting results out of it.
As for the cluster analysis performed for student_housing we first
selected the variables, success_rate_pctand
incidence_rate in this case. Then we found the right number
of clusters using the elbow method and then performed a kmean cluster
analysis.
cluster1<- covid_vs_dnb2021
cluster <- cluster1[-c(1,2)]
row.names(cluster) <- as.vector(t(cluster1[,1]))
cluster <- scale(cluster)
fviz_nbclust(cluster, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) +
labs(subtitle = "Elbow method")
km.res1 <- kmeans(cluster, 5, nstart = 25) #77.4%
fviz_cluster(km.res1, cluster,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)
We end up with five cluster with cluster 1 having the highest success
rate and some of the smallest incidence rate and cluster 3 having the
highest incidence rate some of the worst success rate. We will focus on
these two clusters as they are the main outliers. To do the analysis, we
included the clusters in the data set dnb_results_dep and summarized the
data by cluster.
clust1 <- tibble(department_fr = names(km.res1$cluster),
cluster = km.res1$cluster)
clust_cov_dnb <- inner_join(x = dnb_results_dep, y = clust1)
clust_cov_dnb <- clust_cov_dnb %>%
filter(cluster %in% c(1,3)) %>%
group_by(cluster, session) %>%
summarise(department = n(),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
round(digits = 2)
datatable(clust_cov_dnb, options =list(scrollX = "300px"))The results of the cluster shows that cluster 1 is performing better with a lower incidence rate. Before concluding on the fact that a low incidence rate lead to better results, we needed to see if the difference in success rate was truly induced by the Covid 19 or if it was that these departments were generally performing better. For this comparison, we plotted all their results in two separate graphs but first we had to pivot the data for all variables into one column.
clust_cov_dnb <- clust_cov_dnb %>%
pivot_longer(c(contains("pct")),
names_to = "Mention_type",
values_to = "Rate")
p <- clust_cov_dnb %>%
filter(cluster == 1) %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
labs(title = "Cluster 1", x = "Session", y = "Rate in %" ) +
theme_ipsum()
ggplotly(p, tooltip = c("x" ,"y"))p <- clust_cov_dnb %>%
filter(cluster == 3) %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
labs(title = "Cluster 3", x = "Session", y = "Rate in %" ) +
theme_ipsum()
ggplotly(p, tooltip = c("x" ,"y"))We can see that the difference has not been induced by the Covid 19 as the success rate of cluster 1 has always been higher than the one of cluster 3. Moreover, we see the same pattern for the success rate as they both increased by around 5% in 2020 and slightly went down in 2021. Regarding the distinctions, one can see that it is fairly similar in terms of percentage and pattern apart from the students graduating without honours who used to be a bigger share of the admitted students in the departments belonging to cluster 3.
To conclude on the question regarding the influence of Covid 19 on the results of the Diplome national du Brevet, we have seen through linear regressions and a cluster analysis that Covid19 had clost to no impact the success rate and distinctions. We can assume that as the Covid and the exam were on a national level the government body adapted the scale to the results of the students.
This project allowed us to explore and analyse the results of the Diplome National du Brevet on the 2006-2021 period and discover that political decisions have a far grater impact than any other socioeconomic factor. Indeed, in 2017, we clearly saw the impact of the reform “learn better to succeed better” brought by the minister of education. The distribution of the distinctions went from 45% without, 30% AB, 20% B and 5% TB to around 25% for each. The impact of political actions is clear for the distinctions but not so much for the success rate. It increased steadily over the time period. the only evidence of the impact of a policy action on success rate can be seen in 2020 where due to the Covid 19 politics decided to cancel the final exam and grade using continuous assessment.
To come to the conclusion that political action had a bigger impact on the results of the Diplome National du Brevet than socioeconomic factors, we analysed the impact of three of them. The type of housing and lunch place was one of them. The data on student housing allowed us to discover that this factor had an impact on the distinctions the student were more likely to receive. Indeed, a student with the possibility of going to the canteen is more likely to pass with the highest hounours and less likely to have none than the average student. Conversely, students living in boarding schools will typically underperform. Surprisingly, the type of housing and lunch place has no impact on the success rate. We also explored the impact of sport through establishment labelled Generation 2024. We first saw that despite being created around the Paris Olympic, establishments form all over the country took part in this initiative. We could not see any link between a variation of the results and the labeling or not of an establishment. This probably comes from the fact that the label is built around promotion of sport in sport associations and not doing sport in school meaning that it has a smaller impact on the children. The last socio-economical factor that we analysed was the family situation through a data set about single parent families. An astonishing results was that the rate of single parent families per student admitted is stable across all french departments and remains nearly unchanged over our analysis period. The result of the analysis of a potential impact on the DNB results are that single families do not seem to impact students’ performance.
We also analysed the potential impact of the COVID 19 pandemic on the DNB results. In our exploration of the data we found out that the Covid19 crisis impacted more heavily the eastern part of France in 2021. However, the lock-down guidelines were nationalized meaning that all students were playing with the same rules. Therefore, we found no significant impact of the number of cases per department on the results of the Diplome National du Brevet. The only impact of Covid 19, as previously discussed, can be seen on a national level.
Our analysis has some clear limitations. For our analysis on single parent family we were only able to work on a departmental level and with data on the global population and not only on students passing the Diplome National du Brevet. This is problematic because it lead to some very large approximations at every level of the analysis. For the part on the impact of sport, we chose the data on the label Generation 2024. Labelled establishments have a certain duty to promote the Paris Olympics and sports in general but the objective of the label are quite large. It means that not all labelled establishments have been actively promoting the act of going out and practicing some sport on a regular basis. Having clearer objectives or using another label promoting sport activity could have lead to more conclusive results. In our analysis of the impact of Covid, we decided to take into account all cases during the year but not all patients reacted the same way leading to different impact on their performance. Moreover, the students were evaluated on the basis of continuous assesment, it would have been more conclusive to analyse the results of each assessment during the years than the final results.
For further researches, it would be interesting to analyse the impact of Covid and single parent families by establishment to have more definitive results. In the quest of finding socioeconomic factors influencing children one would have many more to analyse such as the wealth of the parents, the size of the house they live in, the number of sports clubs where they are registered, the number of hours they spend on screens, the number of kilometers traveled to go to schools. The list is clearly non-exhaustive and one could try to analyse any factor that could be related to the results of the DNB. Another very interesting research that should be done is an analysis comprising several parameters to identify the ones having the bigger impact. Such an analysis would allow the creation of a model to predict the results of the different estasblishments.
Bien dans ma ville (2022). Les départements les plus peuplés de France. Retrieved from https://www.bien-dans-ma-ville.fr/classement-departement-nb-habitant/
Cojean, T. (April 3, 2020). Coronavirus: le bac et le brevet 2020 notés en contrôle continu. l’Etudiant. Retrieved from https://www.letudiant.fr/bac/coronavirus-le-bac-et-le-brevet-notes-en-controle-continu.html
Vice-rectorat de Polynésie Francaise (2016) Mieux comprendre le collège 2016 https://www.monvr.pf/mieux-comprendre-le-college-2016/
Peretti, J-M (2004) Gestion des ressources humaines, Vuibert
Amadieu, F; Tricot, A (2022) Les facteurs psychologiques qui ont un effet sur la réussite des étudiants, https://doi.org/10.4000/apliut.5155
Richard, E. (january 2009) La migration pour études collégiales: Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants